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(54) Grouping words with equivalent substrings by automatic clustering based on suffix 
relationships 

(57) A set of words of a natural language is grouped 
by automatically obtaining suffix relation data that indi- 
cate a relation value for each of a set of relationships 
between suffixes that occur in the natural language, 
and, then, by automatically clustering the words in the 
set using the relation values from the suffix relation 
data, to obtain group data indicating groups of words. 
Two or more words in a group have suffixes as in one of 
the relationships and, preceding the suffixes, equivalent 
substrings. The relationships can be pairwise relation- 
ships, and the relation value can indicate the number of 
occurrences of a suffix pair. The suffix relation data can 
be obtained using an inflectional lexicon. Complete link 
clustering can be used. 
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Description 

Field of the Invention 

[0001] The invention relates to grouping words, s 
some of which include equivalent substrings. 

Background and Summary of the Invention 

[0002] Some conventional techniques group related 10 
words using some type of manual choice. For example, 
Debili, R, Analyse SyntaxkxhSemantique Fondee Sur 
Une Acquisition Automatique de Relations Lexicales- 
Semantiques, Doctoral Thesis, IJjniv. Paris XI, Jan. 26, 
1 982, pp. 1 74-223, discloses such a technique to obtain 15 
families of words. The Debili thesis discloses that each 
word is truncated by removing suffixes (and possibly 
prefixes) to obtain a stem ("radical"). Binary correlation 
matrices of suffixes are created by examining an auto- 
matically produced correlation matrix of suffixes and 20 
manually produced compatibility and incompatibility 
matrices of suffixes, and are corrected by inserting a 
zero for suffixes that are not compatibla The suffix 
matrices and the stems can then be used to automati- 
cally obtain families of words, with certain manual cor- 2s 
rections. 

[0003] In contrast, other conventional techniques 
automatically group words without manual intervention. 
Adamson, G. , and Boreham, J., "The use of an Associ- 
ation Measure Based on Character Structure to Identify 30 
Semantical ly Related Pairs of Words and Document 
Titles", Information Storage and Retrieval, Vbl. 10, 
1974. pp. 253-260. disclose such an automatic word 
classification technique based on comparison of pairs 
of consecutive characters, called digrams. The tech- 35 
nique computes a similarity coefficient between pairs of 
words based on the number of digrams common to the 
words and on the sum of the total numbers of digrams in 
the words, to obtain a matrix of similarity coefficients for 
all pairs of words. The matrix is then used to cluster the 40 
words by the method of single linkage, to produce a 
numerically stratified hierarchy of clusters. 
[0004] Lennon, M., Peirce, D.S., Tarry, B.D., and 
Willett. P., "An evaluation of some conflation algorithms 
for information retrieval", Journal of Information 45 
Science. Vol. 3. 1981. pp. 177-183. describe stemming 
algorithms that reduce all words with the same root to a 
single form by stripping each word of its derivational and 
inflectional affixes. If prefixes are not removed, the pro- 
cedure conflates all words with the same stem. Lennon so 
et at. describe an evaluation to determine whether the 
reduction in implementation costs for machine process- 
ing algorithms is achieved at the expense of a decrease 
in conflation performance, when compared to algo- 
rithms based on manual evaluation of possible suffixes, ss 
They conclude that there is relatively little difference 
despite the different ways algorithms are developed, 
and that simple, fully automated methods perform as 



well for English language information retrieval as proce- 
dures which involve a large degree of manual involve- 
ment in their development. 

[0005] The invention addresses a basic problem 
that arises in grouping related words. Conventional 
techniques exhibit a tension between accuracy and 
speed: Manual techniques can be used to group words 
very accurately, but are complex and tedious. Automatic 
techniques, on the other hand, can be very fast, but pro- 
duce groupings that are not as generally accurate as 
can be obtained manually. 

[0006] The invention is based on the discovery of a 
new automatic technique for grouping words that allevi- 
ates the tension between accuracy and speed The new 
technique automatically obtains suffix relation data indi- 
cating a relation value for each of a set of relationships 
between suffixes that occur in a natural language; the 
relation value for a relationship could, for example, be its 
frequency of occurrence in a set of words from the nat- 
ural language. The new technique then performs auto- 
matic clustering of a set of words using the relation 
values from the suffix relation data, to obtain groups of 
words, where two or more words in a group have suf- 
fixes as in one of the relationships and, preceding the 
suffixes, equivalent substrings. 
[0007] The new technique can be implemented for 
pairwise relationships between suffixes, with the rela- 
tion value of each suffix pair being the number of pairs 
of words that are related to each other by the pair of suf- 
fixes. Automatic clustering can then be performed with 
the pairwise similarity between words being the greatest 
relation value of the suffix pairs, if any, that relate the 
words to each other. Complete link clustering can be 
used. The new technique can be implemented using a 
lexicon, such as an inflectional lexicon, to automatically 
obtain a word Gst and then to use the word list in auto- 
matically obtaining suffix pair data. The suffix pair data 
can indicate pairs of suffixes that relate words to each 
other and, for each pair of suffixes, a relation value indi- 
cating a number of times the suffix pair occurs in the 
word list The suffix pair data can further indicate, for 
each suffix in a pair, a part of speech, and the relation 
value can accordingly indicate the number of times the 
suffixes in the suffix pair occur in the word list with the 
indicated parts of speech. 

[0008] A representative for each group of words 
indicated by the group data can also be automatically 
obtained, such as the shortest word in the group. Fur- 
ther, a data structure, such as a finite state transducer 
(FST). can be automatically produced that can be 
accessed with a word in a group to obtain the group's 
representative. The data structure can also be 
accessed with a group's representative to obtain a fist of 
the words in the group. 

[0009] The new technique can further be imple- 
mented in a system that includes memory and a proces- 
sor that automatically obtains the suffix relation data 
and automatically clusters the set of words to obtain the 
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group data, storing the suffix relation data and the group 
data in memory. The processor can also automatically 
produce a data structure as described above and pro- 
vide it to a storage medium access device for storage on 
a storage medium or to another machine over a net- 5 
work 



Fig. 6 is a flow chart showing in greater detail how 
suffix pairs are obtained in Fig. 5. 

Fig. 7 is a flow chart showing in greater detail how 
words are clustered in Fig. 5. 



[001 0] In comparison with conventional techniques Detailed Description of the Invention 
for grouping words with manual choice or other manual 

involvement, the new technique is advantageous A. Conceptual Framework 

because it is automatic and therefore can be performed 10 

quickly. In addition, with appropriate clustering tech- [0016] The following conceptual framework is help- 

niques, the new technique can approach the accuracies ful in understanding the broad scope of the invention, 

obtainable with manual techniques. and the terms defined below have the indicated mean- 

[001 1] In comparison with conventional automatic ings throughout this application, including the claims, 

techniques for grouping words, the new technique is 75 [0017] The term "data" refers herein to physical sig- 

signif fcantly more accurate. Indeed, when applied to the nals that indicate or include information. When an item 

problem of stemming, the new technique is also more of data can indicate one of a number of possible afterna- 

accurate than conventional semi-automatic stemmers tives, the item of data has one of a number of "values", 

that rely on a fist of suffixes, a set of rules, and a fist of For example, a binary item of data, also referred to as a 

exceptions. 20 "bit" has one of two values, interchangeably referred to 

[0O12] The new technique is also advantageous asT and "0" or "ON" and "OFF" or "high" and "tow", 

because it can be readily applied to additional Ian- [0018] The term "data" includes data existing in any 

guages for which inflectional lexicons are available. A physical form, and includes data that are transitory or 

language's inflectional lexicon can be used to automati- are being stored or transmitted For example, data could 

cally obtain suffix pairs with relation values. 25 exist as electromagnetic or other transmitted sisals or 

[0013] The new technique is also advantageous as signals stored in electronic magnetic, or other form, 

because rt can be implemented to use complete words [0019] "Circuitry" or a "circuit" is any physical 

as group representatives. In comparison with tech- arrangement of matter that can respond to a first signal 

niques that use substrings to represent groups, this is at one location or time by provkfing a second signal at 

advantageous because it avoids ambiguous represent- 30 another location or time. Circuitry "stores" a first signal 

atives that could represent more than one group. when it receives the first signal at one time and. in 

[0014] The following description, the drawings, and response, provides substantially the same signal at 

the claims further set forth these and other aspects, another time Circuitry "transfers" a first signal when it 

objects, features, and advantages of the invention. receives the first signal at a first location and, in 

35 response, provides substantially the same signal at a 

Brief Description of the Drawings second location. 

[0020] A "data storage medium" or "storage 

I 0015 ! medium" is a physical medium that can store data. 

Examples of data storage mecfia include magnetic 

Fig. 1 is a schematic flow diagram showing how 40 media such as diskettes, floppy cfisks, and tape; optical 

word groups can be obtained using suffix relation media such as laser disks and CD-ROMs; arxisernicon- 

data ductor mecfia such as semiconductor ROMs and RAMs. 

As used herein, "storage medium" covers one or more 

Fig. 2 is a flow chart showing general acts in obtain- distinct units of a medium that together store a body of 

ing word groups by automatically obtaining suffix 45 data. For example, a set of diskettes storing a single 

relation data and by automatically clustering a set body of data would together be a storage rredium 

of words. [0021] A "storage medium access device" is a 

device that includes circuitry that can access data on a 

Fig. 3 is a schematic diagram showing components data storage medium Examples include drives for 

of a system that can perform the general acts in Fig. so accessing magnetic and optical data storage media. 

2 [0022] "Memory circuitry" or "memory" is any cir- 
cuitry that can store data, and may include local and 

Fig. 4 is a schematic diagram of a system in which remote memory and input/output devices. Examples 

the general acts in Fig. 2 have been implemented. include semiconductor ROMs. RAMs, and storage 

55 medium access devices with data storage media that 

Fig. 5 is a flow chart showing how the system of Fig. they can access. 

4 implements acts as in Fig. 2. [0023] A "data processor" or "processor is any 

component or system that can process data, and may 
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include one or more central processing units or other 
processing components. 

[0024] A processor performs an operation or a func- 
tion "automatically" when it performs the operation or 
function independent of concurrent human intervention s 
or control. 

[0025] Any two components are "connected" when 
there is a combination of circuitry that can transfer sig- 
nals from one of the components to the other. For exam- 
ple, two components are "connected" by any w 
combination of connections between them that permits 
transfer of signals from one of the components to the 
other. 

[0026] A "network* is a combination of circuitry 
through which a connection for transfer of data can be is 
established between machines. An operation "estab- 
lishes a connection over a network if the connection 
does not exist before the operation begins and the oper- 
ation causes the connection to exist 
[0027] A processor "accesses" an item of data in 20 
memory by any operation that retrieves or modifies the 
item or information within the Hern, such as by reading 
or writing a location in memory that includes the item. A 
processor can be "connected for accessing" an Hern of 
data by arty combination of connections with local or 2s 
remote memory or input/output devices that permits the 
processor to access the item 
[0028] A processor or other component of circuitry 
"uses" an item of data in performing an operation when 
the result of the operation depends on the value of the 30 
item. 

[0029] A processor accesses a f irst item of data 
"with" a second item of data if the processor uses the 
second item of data in accessing the first, such as by 
using the second item to obtain a location of the first 35 
item of data or to obtain information from within the first 
item of data. 

[0030] To "obtain" or "produce" an item of data is to 
perform any combination of operations that begins with- 
out the item of data and that results in the item of data. 40 
To obtain a first item of data "based on" a second item 
of data is to use the second item to obtain the first item 
[0031] An item of data "indicates" a thing, event, or 
characteristic when the item has a value that depends 
on the existence or occurrence of the thing, event, or 45 
characteristic can be obtained by operating on the item 
of data. An item of data "indicates" another value when 
the item's value is equal to or depends on the other 
value. 

[0032] An operation or event "transfers" an item of so 
data from a first component to a second if the result of 
the operation or event is that an item of data in the sec- 
ond component is the same as an item of data that was 
in the first component prior to the operation or event. 
The first component "provides" the data, and the sec- ss 
ond component "receives" or "obtains" the data. 
[0033] A "natural language" is an identified system 
of symbols used for human expression and communica- 



tion within a community, such as a country, region, or 
locality or an ethnic or occupational group, during a 
period of time. Some natural languages have a standard 
system that is considered correct, but the term "natural 
language" as used herein could apply to a dialect, ver- 
nacular, jargon, cam, argot, or patois, if identified as dis- 
tinct due to differences such as pronunciation, grammar, 
or vocabulary. 

[0034] A "natural language set" is a set of one or 
more natural languages. 

[0035] "Character means a discrete element that 
appears in a written, printed, or phonetically transcrfoed 
form of a natural language Characters in the present 
day English language can thus include not only alpha- 
betic and numeric elements, but also punctuation 
marks, diacritical marks, mathematical and logical sym- 
bols, and other elements used in written, printed, or 
phonetically transcribed English. More generally, char- 
acters can include, in addition to alphanumeric ele- 
ments, phonetic ideographic or ptaographic elements. 
[0036] A "word" is a string of one or more elements, 
each of which is a character or a combination of charac- 
ters, where the string is treated as a semantic unit in at 
least one natural language. A word "occurs" in each lan- 
guage in which it is treated as a semantic unit 
[0037] A lexicon" is used herein to mean a data 
structure, program, object, or device that indicates a set 
of words that may occur in a natural language set. A lex- 
icon may be said to "accept" a word it indicates, and 
those words may thus be called "acceptable" or may be 
referred to as "in" or "occurring in" the lexicon. 
[0038] As used herein, an Inflectional lexicon" is a 
lexicon that, when accessed with a correctly inflected 
input word, provides access to a lemma or normalized 
dictionary-entry form of the input word. An inflectional 
lexicon typically includes one or more data structures 
and a lookup routine for using the input word to access 
the data structures and obtain the lemma or an output 
indicating the input word is unacceptable 
[0039] A "prefix" is a substring of characters occur- 
ring at the beginning of a word, and a "suffix" is a sub- 
string of characters occurring at the end of a word. 
[0040] A suffix "follows" a substring in a word and 
the substring "precedes" the suffix if the last character 
of the substring immediately precedes the first charac- 
ter of the suffix. 

[0041] A "relationship" between suffixes refers to 
the occurrence in a natural language set of a set of 
words that are related but that have different suffixes, 
which are thus "related suffixes". A "pairwise relation- 
ship" is a relationship between two suffixes. A relation- 
ship between suffixes "occurs" when a natural language 
set includes a set of related words, each of which has 
one of the suffixes. If a part of speech is also indicated 
for each suffix, the relationship only "occurs" if the 
related word that has a suffix also has the indicated part 
of speech. 

[0042] Substrings that precede related suffixes in a 
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set of different words are "equivalent" rf the words are all 
related because of a relationship between the sub- 
strings. For example, it is conventional to make minor 
graphical changes in a substring that precedes a suffix 
during inflectional changes, such as by adding or delet- 
ing a diacritical mark or otherwise changing a character 
to indicate a change in pronunciation or by changing 
between a single character and a doubled character. 
Substrings that precede suffixes may also be equivalent 
because they are phonetic alternatives, because of a 
historical relationship through which one developed 
from the other or both developed from a common 
ancestor, because they are cognates in two different 
languages, or because of any of various other relation- 
ships. - 
[0043] The "frequency of occurrence" of a suffix 
relationship in a set of words is the number of different 
subsets of words in the set that are related by the suffix 
relationship. 

[0044] A set of suffixes "relates" a set of words if 
each of the words can be obtained from any other in the 
set by a process that includes removing one of the suf- 
fixes and adding another of the suffixes. The process 
may also include other modifications, such as to a sub* 
string preceding the suffix or to a prefix that precedes 
the substring, but if there are no such other modifica- 
tions, a prefix that includes the substring "occurs" with 
each of the suffixes in the set to form the set of related 
words. 

[0045] A "clustering" is an operation that groups 
items based on similarity, association, or another such 
measure. To "cluster" is to perform a clustering. 
[0046] A "pairwise similarity" is an item of data indi- 
cating a measure of similarity between two items. 
[0047] A finite state transducer (FST) is a data 
processing system having a finite number of states and 
transitions (or arcs), each transition originating in a state 
and leading to a state, and in which each transition has 
associated values on more than one level. As a result, 
an FST can respond to an input signal incfi eating a value 
on one of the levels by following a transition with a 
matching value on the level and by providing as output 
the transition's associated value at another level. A two- 
level transducer, for example, can be used to map 
between input and output strings, and rf the values are 
character types, the input and output strings can be 
words. 

[0048] A "finite state transducer data structure" or 
"FST data structure* is a data structure containing infor- 
mation sufficient to define the states and transitions of 
an FST. 

[0049] The term "word list" is used herein in the 
generic sense of a data structure that indicates a set of 
words. The data structure could, for example, be a finite 
state machine (FSM) data structure, an FST data struc- 
ture, a list data structure, or any other appropriate type 
of data structure. 

[0050] A "representative" of a group is an item of 



data that is unique to the group so that it can be used to 
represent the group. A representative may be one of the 
members of a group of items of data or it may be an item 
of data obtained in some other way. 

5 

B. General Features 

[0051] Figs. 1-3 illustrate general features of the 
invention. 

10 [Q052] Fig. 1 is a flow diagram that shows schemat- 
ically how word groups can be obtained. In Hg. 1. the 
boxes at left represent external input to word grouping, 
the boxes in the center represent operations performed 
during word grouping, and the boxes at right represent 

is intermediate and final word grouping results. 

[0053] The input in box 10 provides information 
about a natural language set from which the operation 
in box 12 can obtain suffix relation data, illustratively 
shown as an intermediate result in box 14. As shown, 

20 the suffix relation data include, for each suffix relation- 
ship, a relation value; suffix relations A and B illustra- 
tively have relation values a and b, respectively. 
[0054] The suffix relation data in box 14 and a set of 
words, wordl through wordM as shown in box 20, can 

25 then be used by the operation in box 22, which clusters 
the words using the relation values to obtain group data 
indicating groups of the words (word poup 1 through 
word group N), illustratively shown as a final result in 
box 24. As illustrated in box 24, a word group (illustra- 

30 tively word group 1) can include two or more words that 
have suffixes as in one of the relationships and, preced- 
ing the suffixes, equivalent substrings. 
[0055] In box 30 in Fig. 2, a technique automatically 
obtains suffix relation data indicating a relation value for 

as each of a set of relationships between suffixes that 
occur in a natural language set Then, in box 32, the 
technique automatically clusters a set of words that may 
occur in the natural language set Clustering in box 32 
uses the relation values from the suffix relation data, 

40 and obtains group data indicating groups of words, 
where a group includes two or more words that have 
suffixes as in one of the relationships and, preceding 
the suffixes, equivalent substrings. 
[9056] Machine 50 in Fig. 3 includes processor 52 

45 connected for receiving information about a natural lan- 
guage from a source 54 and also connected for access- 
ing data in memory 56 and for receiving instruction data 
60 indicating instructions processor 52 can execute. 
[0057] In executing the instructions indicated by 

so instruction data 60, processor 52 obtains suffix relation 
data 62 which indude, for each of a set of suffix relation- 
ships, a relation value. Processor 52 then clusters a set 
of words indicated by word set data 64 using relation 
values from suffix relation data 62 to obtain word group 

55 data 66, indicating groups of words, a group including 
two or more words that have suffixes as in one of the 
relationships and. preceding the suffixes, equivalent 
substrings. 
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[0058] Fig. 3 illustrates three possible destinations 
to which data output circuitry 70 could provide word 
group data 66-memory 72, storage medium access 
device 74, and network 76. In each case, word group 
data 66 could be provided separately or as part of a 
body of data that may also include instructions and 
other data that would be accessed by a processor in 
executing the instructions. 

[0059] Memory 72 could be any conventional mem- 
ory within machine 50, including random access mem- 
ory (RAM) or read-only memory (ROM), or could be a 
peripheral or remote memory device of any kind. 
[0060] Storage medium access device 74 could be 
a drive or other appropriate device or circuitry for 
accessing storage medium 80, which could, for exam- 
ple, be a magnetic medium such as a set of one or more 
tapes, diskettes, or floppy disks: an optical medium 
such as a set of one or more CD-ROMs; or any other 
appropriate medium for storing data Storage medium 
80 could be a part of machine 50, a part of a server or 
other peripheral or remote memory device, or a soft- 
ware product In each of these cases, storage medium 
80 is an article of manufacture that can be used in a 
machine. 

[0061] Network 76 can provide word group data 66 
to machine 90. Processor 52 in machine 50 can estab- 
lish a connection with processor 92 over network 76 
through data output circuitry 70 and network connection 
circuitry 94. Either processor could initiate the connec- 
tion, and the connection could be established by any 
appropriate protocol. Then processor 52 can access 
word group data 66 stored in memory 56 and transfer 
the word group data 66 to processor 92 over network 
76. Processor 92 can store word group data 66 in mem- 
ory 94 or elsewhere, and can then access it to perform 
lookup. 

C. Implementation 

[0062] The general features described above could 
be implemented in numerous ways on various 
machines to obtain word groups. An implementation 
described below has been implemented on a Sun 
SPARC workstation running Sun OS and executing 
code compiled from C and Perl source code. 

C.1. Overview 

[0063] In Fig. 4, system 120 includes the central 
processing unit (CPU) 122 of a Sun SPARC worksta- 
tion, which is connected to display 124 for presenting 
images and to keyboard 126 and mouse 128 for provid- 
ing signals from a user. CPU 122 is also connected so 
that it can access memory 130, which can illustratively 
include program memory 132 and data memory 134. 
[0064] The routines stored in program memory 132 
can be grouped into several functions-suffix pair extrac- 
tion routines 140, relational family construction routines 



142, FST conversion routines 144, and lookup routines 
146. Fig. 4 also shows several data structures stored in 
data memory 134 and accessed by CPU 122 during 
execution of routines in program memory 

5 132— inflectional lexicon 150; word list 152; list 154, list- 
ing suffix pairs with frequencies; list 156, listing rela- 
tional families of words; FST data structure 158; and 
miscellaneous data structures 160. Inflectional lexicon 
150 can be any appropriate lexicon for the language of 

w the words in word list 152, such as Xerox lexicons for 
English or for French, both available from InXight Cor- 
poration, Palo Alto. California. Word list 152 can beany 
appropriate word fist, such as the word lists that can be 
extracted from the Xerox lexicons for English or for 

rs French. 

[0065] Fig. 5 illustrates high-level acts performed by 
processor 122 in executing some of the routines stored 
in program memory 132. 

[0066] In executing suffix pair extraction routines 

20 140, processor 122 uses inflectional lexicon 150 to 
automatically obtain word fist 1 52, as shown in box 200. 
Then processor 122 uses word fist 152 to automatically 
obtain list 154, which fists a set of suffix pairs and their 
frequencies, as shown in box 202. 

25 [0067] The acts in boxes 200 and 202 are thus one 
way of implementing the act in box 30 in Fig. 2, and the 
act in box 200 is optional because a word list could be 
obtained in other ways. The suffix pairs can be viewed 
as strings for making a transition between pairs of 

30 related words in the word list, with one suffix being a 
string that is removed from one word in the pair to obtain 
a prefix and the other suffix being a string that is then 
added to the prefix, after any appropriate modifications 
in the prefix, to obtain the other word in the pair. 

35 [0068] In executing relational family construction 
routines 142, processor 122 then automatically clusters 
a set of words from word fist 152 using the suff ix pair fre- 
quencies from fist 154, as shown in box 204. The set of 
words can, for example, share a prefix. The result is list 

40 156, which lists a set of word families, each of which is 
a subset of the words clustered in box 204. As sug- 
gested by the dashed fine in Fig. 5, the act in box 204 
can be repeated for each of a number of sets of words; 
each set can, for example, include words that share a 

45 specific substring. The act in box 204 is thus one way of 
implementing the act in box 32 in Fig. 2. The word fam- 
ilies obtained in this manner are referred to as "rela- 
tional families" to distinguish them from conventional 
derivational families, which they resembla 

so [0069] . In executing FST conversion routines 144, 
processor 1 22 obtains FST data structure 1 58 that pro- 
vides an input-output pairing between each of the words 
in a family and a representative of the family, as shown 
in box 206. A family's representative could, for example, 

55 be its shortest word. The act in box 206 is optional, and, 
like the acts in boxes 200. 202, and 204, can be per- 
formed automatically. 

[0070] In executing lookup routines 146, processor 
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122 can provide input words to FST data structure 158 
to obtained desired output such as a word family repre- 
sentative or a list of the words in a family. The input 
words can be received from keyboard 126 or can be 
indicated by a selection from mouse 128, and the output 5 
can be presented on display 124 as shown. For exam- 
ple, FST data structure 158 can respond to an input 
word by providing as output the representative of a rela- 
tional family from box 204 that includes the input word. 
Conversely, FST data structure 158 can respond to the 10 
representative of a relational family by providing as out- 
put all the words in the relational family. The FST thus 
facilitates rapid lookup of a representative of each rela- 
tional family or of the words of a relational family. 
[0071] The implementation in Figs. 4 and 5 is thus 15 
based on two intuitive premises: First, it should be pos- 
sible to automatically extract a set of suffixes from a lan- 
guage's inflectional lexicon. Second, it should be 
possible to automatically obtain information from a lan- 
guage's inflectional lexicon about relationships between 20 
suffixes that correspond to relationships between fami- 
lies of words. 

C.2. Suffix Pair Extraction 

25 

[0072] Fig. 6 illustrates in more detail how the acts 
in boxes 200 and 202 in Fig. 5 can be implemented. 
[0073] The intuitive premise behind suffix extraction 
as in Fig. 6 is that long words of a given language tend 
to be obtained through derivation, and more precisely 30 
through addition of suffixes, and thus long words can be 
used to identify regular suffixes. 
[0074] It is helpful to think of two words w1 and w2 
of a given language as being p-simflar if and only if two 
conditions are met: Their first p characters of both w1 35 
and w2 are the same and the (p+1 )th characters of w1 
and w2 are not the same. The character strings s1 and 
s2 that begin with the (p+1 )th characters of w1 and w2 
respectively can be referred to as pseudo-suffixes, and 
either or both s1 and s2 can be the empty string-both ao 
can be empty if w1 and w2 differ only in their part of 
speech. The pair (s1, s2) can be referred to as a 
pseudo-suffix pair that links w1 and w2. 
[0075] The notion of p-similarity can be understood 
from examples: The English words "deploraWe" and as 
"deploring!/* are 6-similar, with (able, ingly) an English 
pseudo-suffix pair that links them. Since "deplorable" is 
an adjective and "deploring! y" is an adverb, the pseudo- 
suffix pair can be more precisely written (aWe+AJ, 
ingJy+AV). where +AJ stands for adjective and +AV for so 
adverb; this means that a transition from an adjective to 
an adverb can be made by removing the string "able" 
and adding the string "ingly", or vice versa. 
[0076] The notion of p-similarity can be generalized 
to the broader notion of p/q- equivalence, where two 55 
words w1 and w2 of a given language set are prig-equiv- 
alent if the following conditions are met: The first p char- 
acters of w1 and the first q characters of w2 are 



equivalent and the (p+1 )th character of w1 is not equiv- 
alent to the (q+1)th character of w2. Under this defini- 
tion, p-similarity is a special case of p/q- equivalence in 
which p=q and the first p characters of w1 are identical 
to the first p characters of w2. Where w1 and w2 are pfq 
equivalent, the character strings s1 and s2 that begin 
with the (p+1 )th character of w1 and the (q+1 )th char- 
acter of w2 respectively can be referred to as pseudo- 
suffixes, as described above rh relation to p-similarity. 
[0077] Ideally, the implementation in Fig. 6 would 
return only those pseudo-suffix pairs that are valid, 
where a pair is valid if and only if it includes two actual 
suffixes of the language that describe the transition 
between two words in a derivational family of words in 
the language. In practice, however, an automatic tech- 
nique can only approximate this ideal. The implementa- 
tion in Fig. 6 uses p/q-equivalence and the number of 
occurrences of a pseudo-suffix pair to reduce the 
number of invalid pairs it returns. 
[0078] As p/q-equivalence (or p-similarity) of two 
words increases, the probability increases that their 
pseudo-suffix pair is valid. For example, the 2-simflar 
English words "cfivxJe" and "cfiffer" have a pseudo-suffix 
pair (vide+V, ffer+V), where +V stands for verb. But this 
pair is invalid, because neither •Vide" nor "ffer" is an 
actual suffix of English, and also because "divide" and 
"differ do not belong to the same derivational family. On 
the other hand, any pair of 10-simflar English words ts 
very likely to have a valid pseudo-suffix pair because 
the pseudo-suffixes are likely to be actual suffixes and 
the words are likely to belong to the same derivational 
family. But accepting only pseudo-suffix pairs from 10- 
simflar words will eliminate many valid pseudo-suffix 
pairs that only occur with shorter prefixes, 
[0079] The implementation of Fig. 6 accepts a 
pseudo-suffix pair only if the pair can be obtained from 
a pair of words that are at least 5/5-equrvalent This 
level of p/q-equivalence represents a good tradeoff 
between the increased likelihood that a pair will be valid 
if obtained from words that have high p/q-equivalence 
and the risk of screening out vaDd pairs obtained from 
words with lower p/q-equivalenca Experience suggests 
that a slight change in p/q-equivalence (or p-similarity) 
wiD not significantly change the resulting set of pseudo- 
suffix pairs. 

[00801 Similarly; as the number of occurrences of a 
pseudo-suffix pair increases, the probability increases 
that the pseudo-suffix pair is valid. A pair that occurs 
only once or a few times may relate to an irregular phe- 
nomenon or be invalid. The irnplementation of Fig. 6 
accepts a pseudo-suffix pair only if the pair occurs with 
at least two different prefix sets, where the prefixes in 
each set are ^equivalent. This minimal value is suffi- 
cient to screen out a large number of invalid pairs, but 
remains quite loose so that the same implementation 
can be applied to a variety of different languages. 
[0081] This parameter can be better understood by 
considering exemplary French suffix pairs extracted 
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from a French inflectional lexicon, each with its number 
of occurrences: 

ation+N: er+V/ 782 
+AJ :ment+AV/460 
eur+AJ:ion+N/380 
er+V:on+N/50 
sation+N : tarisme+N /5 

Ail of these pairs are valid except the last which occurs, 
for example, in "autorisation - autorrtarisme" (authorisa- 
tion - authoritarianism). These pairs also show that a 
valid pseudo-suffix pair does not always link words in 
the same derivational family. Fpr example, the pair 
(er+V, on+V) yields a link between "saieT and "saJon" 
(in English, salt and lounge), even though the two words 
refer to different concepts, validity only requires that a 
pseudo-suffix pair relates two words that are in the 
same derivational family, a criterion which is met by the 
pair (er+V, on+V) because it relates, for example, Triser" 
and Trison" (in English, curl (+V) and curt (+N)). 
[0082] As shewn in Fig. 6, the implementation 
begins in box 250 by obtaining an inflectional lexicon 
FST data structure, such as the Xerox inflectional lexi- 
cons for English and French available from InXjght Cor- 
poration of Palo AHo, California. These lexicons 
conclude each acceptable character string with a part- 
of-speech (POS) tag. The act in box 250 can include 
obtaining a handle to access a lexicon that is already 
stored in memory, as illustrated by inflectional lexicon 
150 in Rg. 4. 

[0083] The act in box 252 extracts the non-inflected 
or lemma side of the FST data structure to obtain an 
unf iftered FSM data structure that accepts all charac- 
ter+POS strings that are acceptable to the lemma side 
of the FST data structure. Because the FST data struc- 
ture typically accepts an infinite set of character+POS 
strings, including strings of numbers, the unfiltered FSM 
data structure typically also accepts an infinite set, 
although it could be finite if the FST only accepts a finite 
set of character+POS strings. The act in box 252 can be 
implemented with a conventional automatic FSM 
extraction utility. Ways to implement this and related 
operations can be understood from US-A-5,625,554 
and Kaplan, R.M., and Kay. M., "Regular Models of Pho- 
nological Rule Systems". Computational Linguistics, 
Vol. 20, No. 3. 1994, pp. 331-380 ("the Kaplan and Kay 
article"). 

[0084] The act in box 254 then filters the unfiltered 
FSM data structure from box 252 to produce a filtered 
FSM data structure that accepts only suitable charac- 
ter+POS strings, thus producing word list 152 in Rg. 4. 
For example, character strings can be filtered out that 
end with inappropriate POS tags, such as POS tags that 
indicate numbers of various kinds or that indicate other 
types of character strings that are inappropriate In lan- 
guages in which words can be created by concatena- 
tion, such as German, character strings may be filtered 



out that exceed a maximum appropriate length or a 
maximum appropriate number of concatenated parts, 
such as four. The act in box 254 can be implemented by 
composing the unfiltered FSM with a filtering FSM that 

5 defines the conditions for including a character string in 
the fitered FSM. The filtering FSM can be created using 
conventional techniques, similar to those described in 
US-A-5,625,554 and the Kaplan and Kay article. 
[0085] The act in box 256 then uses the filtered 

w FSM data structure from box 254 to create a prefixing 
FST data structure that outputs a normalized prefix- 
equivalent in response to an acceptable word. In other 
words, the input level of the prefixing FST accepts all of 
the character+POS strings accepted by the fitered 

is FSM, and the output level provides, for each input char- 
acter+POS string, a normalized p-character siring that 
is rj/q -equivalent to a prefix of the input character+POS 
string for some value of q. 

[0086] The act in box 256 can be implemented by 
so creating an intermediate FST that responds to a charac- 
ter string by automatically normalizing characters until a 
string of p normalized characters is obtained; normaliz- 
ing can include, for example, removing cfiacritical marks 
and possibly making replacements of characters and 
25 other modifications that produce equivalent prefixes. In 
the current implementation, each normalizing operation 
replaces a character with another character, thus main- 
taining one-to-one or ^equivalence between charac- 
ters; the implementation could readily be extended to 
30 normafize a doubled character as a single character or 
vice versa or to make other normalizations that change 
the number of characters. 

[0087] The intermediate FST can then be run on 
the word list indicated by the filtered FSM to produce a 

as prefix FSM that indicates aO the prefix-equivalents. The 
prefix FSM can be composed with the filtered FSM to 
produce the prefixing FST. In the prefixing FST, the 
(FH-1)th and following characters along each path at the 
input level are paired with an epsilon at the output level, 

40 meaning that no output is provided in response to those 
characters. 

[0088] The act in box 260 inverts the prefixing FST 
from box 256 to produce an inverted prefixing FST with 
prefix equivalents at the input level and with acceptable 

45 words at the output level. The act in box 260 can be 
implemented by simply switching the elements in the 
label of each transition in the FST. 
[0089] The act in box 262 composes the prefixing 
FST from box 256 with the inverted prefixing FST from 

so be* 260, performing the composition so that the prefix- 
equivalents drop out and so that epsilons are added to 
lengthen the shorter of an input character+POS string 
to equal an output character+POS string, producing a 
prefix-group FST. The prefix-group FST accepts the 

55 word list of the filtered FSM from box 254 at its input 
level and can output every word that had the same pre- 
fix-equivalent in the prefixing FST from box 256. The 
composition in box 262 can be performed with conven- 
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tbnal techniques similar to those described in US-A- 
5.625.554 and the Kaplan and Kay article. 
[0090] The act in box 264 then uses the prefix- 
group FST from box 262 to produce a list of pref ix+suff ix 
alternatives. The act in box 264 can be implemented by s 
reading out the character pairs along each path of the 
prefix-group FST and comparing the characters in each 
pair until a mismatch is found. Before finding a mis- 
match, only one character is saved for the prefix, but 
after a mismatch is found, both characters of each pair 10 
are saved for the suffix. 

[0091] For example, for the path that relates the 
verb "produce" to the noun "production", the pref ix+suf- 
fix alternatives could be represented by the following 
string: (p, r, a d, u. a <e:t), (+V:i>. <&:o>. <d:n>, is 
<a:+N)), where "d" characters have been included to 
compensate for the difference in length, and where "+V" 
and "+N" indicate a verb and a noun, respectively. The 
pref ix of the path is thus "produc", and the path has two 
alternative suffixes. "e+V" and "tiort+N". 20 
[0092] The list produced in box 264 thus implicitly 
detects the p-sirrslarity each pair of related words. In 
addition, me list can be produced in such a way that it is 
alphabetized, so that larger groups of p-similar words 
are adjacent within the fist Asa result the list can read- 2s 
ify be used to produce pseudo-famifies of p-similar 
words, which can then be used as described below in 
relation to Fig. 7. 

[0093] The act in box 266 can then use the list pro- 
duced in box 264 to produce list 154 of suffix pairs and 30 
frequencies. List 154 can be produced by converting the 
suffix alternatives of each item on the list from box 264 
to a suffix pair. If the suffix pair matches a suffix pair on 
a list of previously obtained suffix pairs, the act in box 
266 increments its frequency; if not the suffix pair is 35 
added to the list with a frequency of one. Because the 
automatically obtained suffix pairs in list 154 may not all 
include valid suffixes of a natural language set, they are 
referred to below as "pseudo-suffix pairs". 

40 

C.3. Word Family Construction 

[0094] Fig. 7 illustrates in more detail how the act in 
box 204 in Fig. 5 can be implemented. 
[0095] Word family construction as in Fig. 7 45 
addresses the problem of automatically grouping all the 
words together that belong to the same derivational 
family without grouping words together that belong to 
different derivational families. Simple approaches that 
have been proposed typically do not overcome this so 
problem. 

[0096] One simple approach is to add words to a 
family that have some level of p-similartty with words in 
the family and that relate to words in the family through 
suffix pairs. For example, given the two English suffix ss 
pairs (+V, able+AJ) and (+V, ment+N), we can first 
group the 6-similar words "deploy" and "deployable", 
and then add to this family the word "deployment". But 



this approach will also group "depart" and "department" 
into the same family. 

[0097] In general, then, suffix pairs relate words 
that do not belong to the same derivational family. This 
problem is general to any stemming procedure, as illus- 
trated by the English string "merit" at the end of a word, 
which may or may not correspond be a suffix. This phe- 
nomenon is frequently cited by opponents of fully auto- 
matic stemming procedures, and it is typically overcome 
with a list of exceptions, including exceptions to control 
removal of the suffix "merit". 

[0098] The intuitive premise behind word family 
construction as in Fig. 7 is that suffixes form families, 
and that the use of a suffix usually coincides with the 
use of other suffixes in the same family, while suffixes 
from different families do not co-occur. For example, if 
the string "merit" is not a suffix, as in "department", then 
it is likely that the word obtained after removal of the 
string, i.e. "depart, will support suffixes that do not usu- 
ally co-occur with "merrt", such as "ure" which produces 
"departure". This premise is supported by the manually 
created suffix families disclosed in the Debili thesis, dis- 
cussed above. 

[0099] To automatically construct relational families 
of words based on families of suffixes, the implementa- 
tion of Fig. 7 uses a hierarchical agglomerate cluster- 
ing approach, allowing an element to be added to a 
cluster in accordance with a similarity measure. Specif- 
ically. Fig. 7 illustrates a complete link approach of the 
type disclosed by Rasmussen, E., "Clustering Algo- 
rithms", in Frakes, W.B. and Baeza-Yates, R., eds.. 
Information Retrieval-Data Structures and Algorithms, 
Prentice Hall. Englewood Cliffs, New Jersey, 1992, pp. 
419-436. The complete link approach lakes into 
account the relation between an element and all the ele- 
ments that are already in the cluster to which it may be 
added. The similarity measure used is a measure of 
suffix similarity which is initially based on a frequency 
from box 202. 

[0100] Suffix similarity alone, however, is not suffi- 
cient to relate two words. For example, the suffix pair 
(able+AJ. ingly+AV) would, without more, relate the 
words "enjoyable" and "deploringly", which are unre- 
lated. Therefore, the implementation of Fig. 7 also uses 
prefix criteria to determine whether words should be 
grouped together. 

[0101] A first prefix criterion is applied to obtain 
pseudo-families of words that have an appropriately 
chosen level of similarity or equivalency. It has been 
found that 3-similarity or 3/q-equivaIency are appropri- 
ate criteria to obtain pseudo-families of words for lan- 
guages such as English and French, and appear to 
preserve valid relations between words while being lan- 
guage independent. 

[01 02] An extract of an exemplary 3-similar English 
pseudo-family includes the following words: deployabil- 
ity, deployable. deploy, deployer, department departer, 
depart, departmental, deprecate, deprecation, deprive, 
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depriver, deplore, deplorable, deploringly. 
[0103] A second prefix criterion is applied implicitly 
in finding pairs of words that are related by suffix pairs. 
In other words, a pair of words are only related by a suf- 
fix pair if removing one of the suffixes from one of the 
words, obtaining one or more equivalents of the remain- 
ing prefix, and replacing the removed suffix with the 
other suffix on one of the equivalents produces the 
other word, which can only be true if the prefix that pre- 
cedes the suffix in one word is equivalent to the prefix 
that precedes the suffix in the other word. 
[0104] The implementation of Fig. 7 begins in box 
300 when a call is received to perform clustering on a 
set of words such as word list 152 using a set of 
pseudo-suffix pairs and their frequencies such as list 
154, which can be obtained as in Fig. 6. Before pro- 
ceeding further, the implementation could obtain ail of 
the 3-similar or 3/q-equrvalent pseudo-families in the set 
of words, or each pseudo-family can be obtained as it is 
needed. 

[0105] The act in box 302 begins an outer iterative 
loop that handles each of the pseudo-families in turn. 
The act in box 304 obtains the next pseudo-family, such 
as by accessing a previously obtained pseudo-family or 
by obtaining one anew. 

[0106] Then the act in box 310 begins a first inner 
iterative loop that handles each word pair that can be 
obtained from the pseudo-family. In box 312, each itera- 
tion of the loop sets the suffix-similarity of a word pair. 
The suffix similarity is the highest frequency of the 
pseudo-suffix pairs received in box 300 that relate the 
two words, whether with identical prefixes or acceptable 
equivalent prefixes. If none of the pseudo-suffix pairs 
relate the two words, the suffix similarity is zero. 
[0107] The act in box 320 begins a second inner 
iterative loop that handles each word pair with a suffix 
similarity greater than zero, continuing until the test in 
box 320 determines that none of the word pairs have a 
suffix similarity greater than zero. The act in box 322 
begins each iteration by finding the pair of words that 
are related by a suffix pair with the greatest suffix simi- 
larity. The iteration then branches in box 330 based on 
previous clustering, if any. 

[0108] If both words of the pair have already been 
clustered, the two clusters that include them are 
merged, in box 332. If one of the words has been clus- 
tered and the other has not, the non-clustered word is 
added to the cluster that includes the other word, in box 
334. And if neither of the words has been clustered, a 
new cluster is created that includes both of them, in box 
336. 

[0109] Each iteration also includes adjustment of 
suffix similarities of other words with respect to the 
words of the pair found in box 322. As shown in box 340. 
this is done for every word. In box 342, the greater of a 
word's suffix similarities with the words of the pair is 
reduced to zero. This ensures that the word will only be 
added to the cluster that includes the words of the pair if 



the lesser suffix similarity is relatively large, thus enforc- 
ing the complete link approach. 
[01 1 0] When all the suffix similarities reach zero for 
one of the pseudo-families, the act in box 350 saves 
5 each cluster as a relational family, and also saves each 
unclustered word in a respective one-word relational 
family. The following, separated by lines of asterisks, 
are exemplary clusters obtained from the English 
pseudo-family of 3-similar words beginning with dep: 

w 

***** 

depletabil'rty depletable depletableness depletaWy 

deplete depleter depletion 
***** 

is deployability deployaJWe deployableness depJoyably 

deploy deployer deployment 
***** 

depressant depress depresser depressingfy 
depression depressor depressive depressiveness 
20 depressivefy 

deprecate deprecation deprecator deprecative dep- 
recattveness deprecatively deprecatfvity deprecato- 
ry deprecatory deprecatingry 

25 ***** 

deposabifity deposable deposableness deposably 
depose deposer deposal 

department departmentalRy departmental depart- 
so mentalness departmental 
*** ** 

depart departure departer 

35 [011 1 ] When relational families have been obtained 
for all the pseudo-families, the act in box 352 returns all 
the relational families from all the pseudo-tarnifies, such 
as in the form of list 1 56 in Fig. 4. 

40 C.4. Conversion to FST 

[0112] In the current implementation, a representa- 
tive word can be automatically chosen for each rela- 
tional family returned in box 352. The representative 

45 word can be the shortest word in the family or, if more 
than one word has the shortest length, an arbitrarily 
chosen word with the shortest length, such as the verb, 
or a randomly chosen verb if more than one verb has 
the shortest length. Or the representative could be the 

so first word of the shortest length words in alphabetical 
order. 

[0113] All the relational families of a language can 
then be automatically converted into a finite state trans- 
ducer (FST) of the type disclosed in US-A-5,625,554. 
55 Each transition of the FST can have two levels, a first of 
which indicates a character of a word and a second of 
which indicates a character of the representative of the 
family that includes the word. Such an FST can be pro- 



10 



19 



EP1011056A1 



20 



duced using techniques similar to those disclosed in 
Karttunen, L, "Constructing Lexical Transducers", Pro- 
ceedings of the 75^ International Conference on Com- 
putational Linguistics, Coling 94, August 5-9, 1994, 
Kyoto, Japan, Vol. 1 , pp. 406-41 1 . 5 

C.5. Lookup 

[0114] The FST produced as described above can 
be used in various ways, some of which are similar to w 
techniques described in US-A-5,625,554. For example, 
it can be accessed with the characters of a word to 
obtain a representative of the relational family that 
includes the word. Or it can be accessed with the repre- 
sentative of a relational femilyio obtain the words in the is 
family. Or both of these techniques can be performed in 
sequence to use a word to obtain the words in its family. 
[0115] The FST can also be used to obtain a repre- 
sentative of a relational family that is likely to relate to an 
unknown word, using the techniques dsdosed in 20 
copending, coassigned U.S. Patent Application Na 
09/XXX.XXX (Attorney Docket Na R/98022Q), entitled 
"Identifying a Group of Words Using Modffied Query 
Words Obtained from Successive Suffix Relationships", 
incorporated herein by reference. 2s 

C.6. Results 

[0116] An FST for English words produced as 
descrfoed above has been compared with the deriva- so 
tional lexicon descrfoed in Xerox Corporation, Xerox 
Linguistic Database Reference (English Version 1.1.4 
ed.). 1994 ("the Xerox Database"), considered to be a 
very high quality derivational lexicon. Initially, deriva- 
tional families were extracted from the Xerox Database as 
and compared to the relational families in the FST. The 
comparison was based on the number of words which 
would have to be moved in or out of a relational fairaly to 
obtain a counterpart derivational family. Due to over- 
stemming errors, in which unrelated words are in the 40 
same relational family, and understemming errors, in 
which related words are in different relational families, 
there are often differences between the relational and 
derivational famines. 

[0117] Counterpart relational and derivational farm- as 
lies were identified by assuming that a word wis cor- 
rectly placed in a relational family ri if the relational 
family includes most of the words of the derivational 
family of wi and the derivational family of iw includes 
most of the words of the relational family ri. In other so 
words, a relational family and derivational family are 
counterparts only if they are mostly the same. Words 
that are not in both counterparts must be moved to 
make the relational and derivational families the same, 
and the ratio of words that need not be moved, summed ss 
over all the derivational families, to the total number of 
words in all the derivational families is a distance meas- 
ure between the two sets of families. 



[0118] As a preliminary test relational families were 
obtained from the Xerox inflectional lexicon for English 
using three different hierarchical agglomerate cluster- 
ing techniques—the complete link technique described 
above; a single link technique that makes no use of suf- 
fix family but rather adds an element to a cluster when- 
ever a link, as detected in box 322 in Fig. 7, exists 
between the element and one of the elements of the 
cluster; and a group average technique that makes par- 
tial use of the notion of suffix family by adding an ele- 
ment to a cluster on the basis of the average link 
between the element and the elements of the cluster, 
where the average link can be obtained from the suffix 
similarities obtained in box 31 2 in Fig. 7. When the three 
sets of relational families were compared with the deri- 
vational families from the Xerox Database in the manner 
described above, the following ratios were obtained: sin- 
gle link relational families— 0.47; grotp average rela- 
tional families— 0.77; and complete link relational 
families— 0.835. These results confirm that the notion of 
suffix families is effective, and validate the use of the 
complete link technique. 

[0119] The relational families obtained with the 
complete Dnk technique were also compared with two 
well-known English stemmers-the SMART stemmer 
disclosed in Saltan, Q., and McGill, M.J., Introduction to 
Modem Information Retrieval, New \brfc McGraw-Hill, 
1983. pp. 130-136 and the stemmer described in Porter, 
M.F., "An algorithm for suffix stripping", Program, Vol. 
14, na 3, July 1980, ppi 130-137. The SMART stem- 
mer, which includes a set of rules with conditions of 
applications, was first implemented twenty-five years 
ago and has undergone much manual revision by gen- 
erations of information retrieval researchers. Porter's 
stemmer, currently the most used stemmer, is compact, 
with only 60 different suffixes, and is easy to implement 
even though it performs nearly as well as other, more 
complex stemmers. 

fl>120] The comparison with the SMART stemmer 
and Porter's stemmer was done by first constructing 
families with each stemmer and then comparing the 
constructed famifies with the derivational families from 
the Xerox Database in the manner descrfoed dbove. 
The families were constructed by submitting the whole 
lemmatized lexicon of the Xerox Database to each 
stealer and grouping 

The following ratios were obtained: SMART 
stemmer— 0.82; Porter's stemmer— €.65. 
[0121] These results indicate that the SMART 
stemmer performs better than Porter's stemmer, but 
that the relational families from the complete link tech- 
nique perform at least as well as the SMART stemmer 
and better than Porter's stemmer. In other words, the 
technique described above can produce relational fami- 
lies closer to actual derivational families than the fami- 
lies constructed from the two stemmers. This strength of 
the technique apparently results from its ability to distin- 
guish between groups that may not be distinguished by 
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exceptions in the SMART stemmer and Porter's stem- 
mer. The example above, in which the technique can 
distinguish the relational family of "depart* from the rela- 
tional family of "department", is apparently representa- 
tive of a number of similar cases that are not covered by 5 
exceptions in the stemmers. Another advantage of the 
implementation described above is that the representa- 
tive of each relational family is always an acceptable 
word. 

[0122] An FST for French words produced as 10 
described above from the Xerox inflectional lexicon lor 
French has also been tested, though without the benefit 
of a French derivational lexicon comparable to the 
Xerox Database. The test was made in the framework of 
an information retrieval task, an environment in which is 
differences between stemmers tend to be smoothed. 
The aim of the test was to determine whether the FST 
could increase the performance of an information 
retrieval system operating in French. 
[0123] The test used the data. La document set, 20 
topic set, and relevance judgment set. provided within 
the AMARYLLIS project, descrfoed in Coret, A., Kremer, 
P., Landi. B., Schibler, D.. Schmitt. L, and Viscogliosi, 
N., "Acc&s d rinfbrmation textuelle en trancais: Le cycle 
exptoratoire Amaryllis". / 6res JST 1997 FRANCIL de 25 
I'AUPELF-UREF, 15-16avri! 1997, Avignon, France, pp. 
5-8, a project that evaluates information retrieval sys- 
tems on French data. The test ran three different index- 
ing schemes^-the fret with no treatment and 
considering words as index; the second replacing each 30 
word with its lemma from the Xerox inflectional lexicon; 
and the third replacing each word with the representa- 
tive of its relational family from the FST. The average 
precision increased over the first scheme by approxi- 
mately 16.5% with the second scheme and by approxi- 35 
mat ely 1 8% with the third scheme. 

C.7. Variations 

[01 24] The implementations described above could 40 
be varied in many ways within the scope of the inven- 
tion. 

[0125] The implementations deserted above have 
been successfully executed on Sun SPARC worksta- 
tions, but implementations could be executed on other 45 
machines. 

[0126] The implementations descrfoed above have 
been successfully executed using the C programming 
environment on the Sun OS platform, but other pro- 
gramming environments and platforms could be used. so 
[0127] The implementations described above per- 
form clustering ever relation values that are suffix simi- 
larities measured by frequency of occurrence of 
pseudo-suffix pairs. The invention could be imple- 
mented to perform clustering over any other automati- 55 
cally obtained relation value, whether a distance, a 
similarity or dissimilarity coeff icient, or any other appro- 
priate value, such as mutual information. Further, a rela- 



tion value used in automatic clustering could incficate a 
relation between more than two pseudo-suffixes. 
[0128] The implementations described above 
include pads of speech in each pseudo-suffix pair, but 
this is optional. The invention could be implemented 
without taking part of speech into account 
[01 29] The implementations described above use a 
language's inflectional lexicon to obtain relational fami- 
lies of words, but other information about a language 
could be used to obtain relational families, such as a list 
or other data structure indicating words that are accept- 
able in the languaga 

[0130] The implementations descrfoed above use 
various clustering techniques, including the complete 
link technique, the group average technique, and the 
single link technique, to produce mutually exclusive 
clusters. Other clustering techniques could be used 
within the scope of the invention, including double link 
and other n-Knk clustering and also including tech- 
niques in which a word can be within more than one 
duster. 

[0131] The implementations described above per- 
form clustering on 3-similar pseudo-families obtained 
from a set of words, but clustering could be performed 
on the entire word set or on pseudo-families meeting 
other criteria. 

[0132] The implementations descrfced above yield 
an FST converted cfrectiy from automatically obtained 
relational families, but automatically obtained relational 
families in accordance with the invention could also be 
manually "cleaned" or modified to obtain higher quality 
derivational families more rapidly than would be possi- 
ble with manual techniques. Further, converting groups 
of words obtained in accordance with the invention into 
another form is optional, and groups of words could be 
stored in many other types of data structures in addition 
to an FST, such as in relational databases. 
[0133] Also, adrJtional types of information about 
automatically obtained relational famflies of words could 
be obtained and used in various ways. For example, for 
each relational family, a data structure such as a tree 
that summarizes the suffix relations within the family 
could be automatically obtained. For the relational fam- 
ily {deploy, deployment, deptoyer, deployable. deploya- 
bility}. the top level of such a tree could be (Verb), with a 
link that adds the suffix "-ment" to obtain a (Noun) child, 
with a link that adds the suffix "-er" to obtain another 
(Noun) child, and with a fink that adds the suffix "-able" 
to obtain an (Adjective) child; the (Adjective) child could, 
in turn, have a link that adds the suffix "-ity" to change 
"able" to "ability* to obtain a (Noun) grandchild The 
resulting tree should approximate the ideal derivational 
tree joining the words of a derivational family, thus pro- 
viding access to potential suffixes of a language, their 
morphotactics. and their paradigmatic use. Frequencies 
of occurrence of the trees can also be obtained, and 
more probable suffix relation trees can thus be identi- 
fied. 



40 



45 



12 



23 



EP1 011 056A1 



24 



[0134] To develop a derivational lexicon from rela- 
tional families, a lexicographer could view the results 
obtained in this manner, review the automatically identi- 
fied derivational processes implicit in the results for 
validity, make modifications in accordance with the 5 
actual derivational processes of a language, and gener- 
ate a resulting set of modified relational families for fur- 
ther study and possible use. Modified relational families 
obtained in this way may provide a better approximation 
to derivational families than automatically obtained rela- w 
tional families. A lexicographer might therefore be able 
to develop a derivational lexicon more quickly without 
sacrificing accuracy, because the lexicographer can 
focus on irregularities of the language under considera- 
tion. * 1S 

[0135] The implementations described above use 
an FST that maps from a word to a representative that 
is a shortest length word in the same relational family 
and from a representative to a list of words in the family. 
An FST could, however, map to any other appropriate 20 
representative of a family, such as an extracted root or 
even a number or other value that serves as an index. 
[0136] The implementations descrbed above 
employ relations between suffixes and do not take rela- 
tionships between prefixes into account in grouping 25 
words, but relationships between prefixes could be 
taken into account by removing certain prefixes before 
grouping words and might be taken into account in other 
ways. 

[0137] The implementations described above have 30 
been applied to English and French, and values of 
parameters such as p-simflarity and p/q equivalence, 
although chosen for generality, have only been success- 
fully used with English and French. The invention can 
be applied to languages other than English and French 35 
and values of parameters could be modified as neces- 
sary for greater generality or for optimal results with any 
specific language. 

[0138] In the implementations described above, 
acts are performed in an order that could be modified in 40 
many cases. 

[0139] The implementations described above use 
currently available computing techniques, but could 
readily be modified to use newly discovered computing 
techniques as they become available. . 45 

D. Applications 

[0140] The invention can be applied to automati- 
cally produce an approximation of a derivational lexicon, so 
The result referred to herein as a "relational lexicon", 
can then be used as a derivational lexicon would be 
used. For example, from an input word, the relational 
lexicon can obtain another word that represents a group 
or family of related words, a process sometimes 55 
referred to as "stemming", "normalization", or lemmati- 
zation". Unlike an inflectional lexicon, the relational lexi- 
con can stem, normalize, or lemmatize across parts of 



speech. To improve information retrieval performance, 
this can be done in advance for each word in a database 
being searched and, at the time of search, for each 
word in a query. A search can then be performed by 
comparing the representative words, to find items in the 
database that relate to the query. 
[0141] The relational lexicon can also be used to 
generate a group or family of words from one of the 
words in the group or family. 
[0142] By analogy to an inflectional lexicon, the 
relational lexicon can also be used to go from one part 
of speech to another within a derivational family, a use- 
ful capability in application areas such as machine 
translation. For example, if two languages have counter- 
part multi-word expressions in which a word or subex- 
pression in a first language has a different part of 
speech than the counterpart word or subexpression in a 
second language, translation could be accomplished by 
first obtaining a counterpart word or subexpression in 
the second language that has the same part of speech 
as the word or subexpression in the first language. 
Then, the relational lexicon could be used to derive the 
counterpart word or subexpression in the second lan- 
guage that has the appropriate part of speech. 
[0143] The relational lexicon can also be used as 
part of a terminology extractor from monolingual and 
multilingual points of view, such as to extract indexes 
from a text for accessing a thesaurus or to perform other 
types of controlled indexing. 

E. Miscellaneous 

[0144] The invention has been described in relation 
to software implementations, but the invention might be 
implemented with specialized hardware. 
[0145] The invention has been descrbed in relation 
to implementations using serial processing techniques. 
The invention might also be implemented with parallel 
processing techniques. 

Claims 

1. A method of grouping a set of words that may occur 
in a natural language set, comprising: 

automatically obtaining suffix relation data indi- 
cating a relation value for each of a set of rela- 
tionships between suffixes that occur in the 
natural language set; and 

automatically clustering the words in the set of 
words using the relation values from the suffix 
relation data, to obtain group data indicating 
groups of words; two or more words in a group 
having suffixes as in one of the relationships 
and. preceding the suffixes, equivalent sub- 
strings. 
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2. The method of claim 1 in which the relation value for 
a relationship is its frequency of occurrence. 

3. The method of claim 1 or 2 in which the relation- 
ships between suffixes are pairwise relationships. s 

4. The method of claim 3 in which the relation value of 
each pairwise relationship is the number of pairs of 
words that are related to each other by the pair of 
suffixes in the relationship. w 

5. The method of claim 3 in which the act of automati- 
cally clustering the words comprises: 



6. The method of claim 5 in which the act of perform- 
ing automatic clustering performs complete link 
clustering. 2s 

7. The method of claim 5 or 6 in which the pairwise 
similarity value for a pair of words is equal to the 
greatest relation value of the relationships between 
suffixes that relate the words in the pair to each 30 
other. 

8. The method of claim 3 in which the natural lan- 
guage set includes one natural language and the 
act of automatically obtaining suffix relation data 35 
comprises: 

using a lexicon for the language to obtain a 
word list indicating the set of words; 



9. The method of claim 8 in which the lexicon is an so 
inflectional lexicon for the language. 

10. The method of claim 8 or 9 in which the suffix pair 
data further indicate, for each suffix in a pair, a part 

of speech; the relation value indicating the number ss 
of times the suffixes in the suffix pair occur in the 
set of words with the indicated parts of speech. 



11. The method according to claims 1 to 10, further 
comprising: 

automatically obtaining, for each group of 
words indicated by the group data, a represent- 
ative; and 

automatically producing a data structure that 
can be accessed with a word in a group to 
obtain the group's representative. 

12. The method of claim 11 in which the act of automat- 
ically obtaining a representative selects the short- 
est word in a group as the representative. 

13. The method of claim 11 or 12 in which the data 
structure can also be accessed with a group's rep- 
resentative to obtain a list of words in the group. 

14. The method of daim 11, 12 or 13 in which the data 
structure is a finite state transducer data structura 

15. A system for grouping a set of words that occur in a 
natural language, comprising: 

memory for storing data; and 
a processor connected for accessing the mem- 
ory ; the processor operating to: 

automatically obtain suffix relation data 
indicating a relation value for each of a set 
of relationships between suffixes that 
occur in the natural language; the proces- 
sor storing the suffix relation data in mem- 
ory; and 

automatically cluster the words in the set 
using the relation values from the suffix 
relation data, to obtain group data indicat- 
ing groups of words; two or more words in 
a group having suffixes as in one of the 
relationships and, preceding the suffixes, 
equivalent substrings; the processor stor- 
ing the group data in memory. 

16. The system of daim 15, further comprising: 

an inflectional lexicon stored in memory; 

• the processor, in automatically obtaining 
suffix relation data, accessing the inflec- 
tional lexicon in memory. 

17. The system of daim 15 or 16 in which the processor 
further operates to: 

automatically obtain, for each group of words 
indicated by the group data, a representative; 



obtaining, for each of a set of pairs of words, a is 
pairwise similarity value based on the relation 
value of a suffix pair, if any, that relates the 
words to each other; and 

performing automatic clustering using the pair- zo 
wise similarity values for the pairs of words. 



40 

using the word list to obtain suffix pair data indi- 
cating pairs of suffixes that relate words in the 
set of words to each other; and 

for each pair of suffixes indicated by the suffix 45 
pair data, obtaining a relation value indicating a 
number of times the suffix pair occurs in the set 
of words. 
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and 

automatically produce a data structure that can 
be accessed with a word in a group to obtain 
the group's representative; the processor stor- 5 
ing the data structure in memory. 

18. The system of claim 17, further comprising a stor- 
age medium access device for accessing a storage 
medium; the processor being connected for provid- w 
ing data to the storage medium access device; the 
processor further operating to provide the data 
structure to the storage medium access device; the 
storage medium access device storing the data 
structure on the storage medium. is 

19. An articJe of manufacture produced by the system 
of claim 18; the article of manufacture comprising: 

the storage medium; and 20 

the data structure stored on the storage 
medium. 

20. The system of claim 17, 18 or 19 in which theproc- 25 
essor is further connected for establishing connec- 
tions with machines over a network; the processor 
operating to: 

establish a connection to a machine over the 30 
network; and 

transfer the data structure to the machine over 
the network. 

35 
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